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Abstract 

In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a 
normwise sense, and that in fact if multiplication of n-by-n matrices can be done by any algorithm in 
0(n u+rl ) operations for any n > 0, then it can be done stably in 0(n UJ+rl ) operations for any n > 0. 
Here we extend this result to show that essentially all standard linear algebra operations, including 
LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares 
problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably 
(in a normwise sense) in 0(n UJ+rl ) operations. 

1 Introduction 

Matrix multiplication is one of the most fundamental operations in numerical linear algebra. Its importance 
is magnified by the number of other problems (e.g., computing determinants, solving systems of equations, 
matrix inversion, LU decomposition, QR decomposition, least squares problems etc.) that are reducible to 
it [TU [3TJ [11]. This means that an algorithm for multiplying n-by-n matrices in 0(n u ) operations can be 
converted into an algorithm for these other linear algebra operations that also runs in O(n^) operations. 

In this paper we extend this result to show that if the matrix multiplication algorithm is stable in a 
normwise sense discussed below, then essentially all linear algebra operations can also be done stably, in 
time 0(n u ) or 0(n^ +r ') : for arbitrary n > 0. For simplicity, whenever an exponent contains "+rj" , we will 
henceforth mean "for any n > 0." 

In prior results |23] we showed that any fast matrix multiplication algorithm running in time 0{n u+ri ) was 
either stable or could be converted into a stable algorithm that also ran in 0(n w+? ') operations. Combined 
with the results in this paper, this lets us state that all linear algebra operations can also be done stably in 
0(rf J+ri ) operations. 

More precisely, some of our results (see Theorem 13.31 in Section [3]) may be roughly summarized by saying 
that n-by-n matrices can be multiplied in 0(n" +r> ) operations if and only if n-by-n matrices can be inverted 
stably in 0{n u+TI ) operations. We need to use a little bit of extra precision to make this claim, and count 
operations carefully; the cost of extra precision is accounted for by the 0(n v ) factor. 

Other results (see Section [4]) may be summarized by saying that if n-by-n matrices can be multiplied 
in 0(n w+r ') arithmetic operations, then we can compute the QR decomposition stably (and so solve linear 
systems and least squares problems stably) in 0(n w+r ') arithmetic operations. These results do not require 
extra precision, which is why we only need to count arithmetic operations, not bit operations. 

The QR decomposition will then be used to stably compute a rank-revealing decomposition, compute 
the (generalized) Schur form, and compute the singular value decomposition, all in 0(n w+v ) arithmetic 
operations. To compute (generalized) eigenvectors from the Schur form we rely on solving the (generalized) 
Sylvester equation all of which can be done stably in 0(n" +I ') bit operations. 
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Now we become more precise about our notions of stability. We say an algorithm for multiplying n-by-n 
square matrices C — A- B is stable if the computed result C comp satisfies the following normwise error bound: 

\\C comp - C\\ < n{n)e\\A\\ \\B\\ + 0(e 2 ), (1) 

where e is machine epsilon (bounds the roundoff error) and /x(n) is a (low degree) polynomial, i.e., p,(n) — 
0(n c ) for some constant c. Note that one can easily switch from one norm to another at the expense of 
picking up additional factors that will depend on n, using the equivalence of norms on a finite-dimensional 
space, thereby changing the constant c slightly. The bound (JTJ) was first obtained for Strassen's 0(n 2 S1 ) 
algorithm i49] by Brent ([T21 [33], [3H chap. 23]) and extended by Bini and Lotti [5] to a larger class 
of algorithms. In prior work [23j we showed that such a bound holds for a new class of fast algorithms 
depending on group-theoretic methods [T3] and |17) . which include an algorithm that runs asymptotically 
as fast as the fastest known method due to Coppersmith and Winograd [TH], which runs in about 0(n 2 38 ) 
operations. Using a result of Raz [43] , that work also showed that any fast matrix multiplication algorithm 
running in 0(n u+v ) arithmetic operations can be converted to one that satisfies ([1]) and also runs in 0(n' jj+ri ) 
arithmetic operations. 

In Section[5]we begin by reviewing conventional block algorithms used in practice in libraries like LAPACK 
PQ and ScaLAPACK [TO] . The normwise backward stability of these algorithms was demonstrated in [531 
1351 [231 134] using fTJ as an assumption; this means that these algorithms are guaranteed to produce the exact 
answer (e.g., solution of a linear system) for a matrix C close to the actual input matrix C, where close 
means close in norm: \\C — C\\ = 0(e)\\C\\. Here the 0(e) is interpreted to include a factor n c for a modest 
constant c. 

What was not analyzed in this earlier work was the speed of these block algorithms, assuming fast matrix 
multiplication. In Section [5] we show that the optimal choice of block size lets these block algorithms run 
only as fast as 0(n 4 -^ ) operations, where 0(n 7 ) is the operation count of matrix multiplication. (We use 
7 instead of ui + r\ to simplify notation.) Even if 7 were to drop from 3 to 2, the exponent would only 
drop from 3 to 2.5. While this is an improvement, we shall do better. 

In Section [3J wc consider known divide-and-conquer algorithms for reducing the complexity of matrix 
inversion to the complexity of matrix multiplication. These algorithms are not backward stable in the 
conventional sense. However, we show that they can achieve the same forward error bound (bound on the 
norm of the error in the output) as a conventional backward stable algorithm, provided that they use just 
0(log p n) times as many bits of precision in each arithmetic operation (for some p > 0) as a conventional 
algorithm. We call such algorithms logarithmically stable. Incorporating the cost of this extra precise 
arithmetic in the analysis only increases the total cost by a factor at most log 2p n. Thus, if there are matrix 
multiplication algorithms running in 0(n UJ+n ) operations for any 77 > 0, then these logarithmically stable 
algorithms for operations like matrix inversion also run in 0(n" +11 ) operations for any 77 > 0, and achieve 
the same error bound as a conventional algorithm. 

In Section 14.11 we analyze a divide-and-conquer algorithm for QR decomposition described in |27j that 
is simultaneously backward stable in the conventional normwise sense (i.e. without extra precision), and 
runs in 0(n UJ+ri ) operations for any rj > 0. This may be in turn used to solve linear systems, least squares 
problems, and compute determinants equally stably and fast. We apply the same idea to LU decomposition 
in Section [4.2l but stability depends on a pivoting assumption similar to, but slightly stronger than, the usual 
assumption about the stability of partial pivoting. 

In Section [5] we use the QR decomposition to compute a rank revealing U RV decomposition of a matrix 
A. This means that U and V are orthogonal, R is upper triangular, and R reveals the rank of A in the 
following sense: Suppose a\ > •• • > ay are the singular values of A. Then for each r, cr m i n (R(l : r, 1 : r)) is 
an approximation of ay and cr max (R(r + 1 : n, r + 1 : n)) is an approximation of oy+i. (Note that if R were 
diagonal, then the URV decomposition would be identical to the singular value decomposition, and these 
approximations would be exact.) Our algorithm will be randomized, in the sense that the approximations of 
ay and ay+i are reasonably accurate with high probability. 

In Section 16.11 we use the QR and URV decompositions in algorithms for the (generalized) Schur form 
of nonsymmetric matrices (or pencils) [5], lowering their complexity to 0(n^ +ri ) arithmetic operations while 
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maintaining normwise backward stability. The singular value decomposition may in turn be reduced to 
solving an eigenvalue problem with the same complexity fSection l6.2p . Computing (generalized) eigenvectors 
can only be done in a logarithmically stable way from the (generalized) Schur form. We do this by providing 
a logarithmically stable algorithm for solving the (generalized) Sylvester equation, and using this to compute 
eigenvectors. A limitation of our approach is that to compute all the eigenvectors in 0(n" +,) ) bit operations, 
all the eigenvectors may in the worst case have a common error bound that depends on the worst conditioned 
eigenvector. 



2 Conventional Block Algorithms 



A variety of "block algorithms" that perform most of their operations in matrix multiplication are used in 
practice [1] [10] and have been analyzed in the literature [24 , 33 , 25] [34] , and it is natural to consider these 
conventional algorithms first. For example, [53] does a general error analysis of block algorithms for LU 
factorization, QR factorization, and a variety of eigenvalue algorithms using the bound (JTJ) , and shows they 
are about as stable as their conventional counterparts. What was not analyzed in [24j was the complexity, 
assuming fast matrix multiplication. 

We will use the notation MM(p,q,r) to mean the number of operations to multiply a p-by-q times a 
g-by-r matrix; when q < p,r, this is done by ^ • £ multiplications of q-by-q matrices, each of which costs 
MM(q, q, q) = 0(q' y ) for some 2 < 7 < 3. (We use 7 instead of uj + r\ in order to simplify notation.) Thus 
MM(p,q,r) = 0(pq 7 ~ 2 r). Similarly, when p is smallest MM(p,q,r) = 0(p~ f ~ 2 qr), and so on. Also we will 
abbreviate MM(n, n, n) = MM (n). 

Consider block LU factorization with pivoting. Given a choice of block size 6, the algorithm breaks 
the n-by-n matrix A into blocks of b columns, then LU factorizes each such block using the conventional 
algorithm, and then updates the trailing part of the matrix using fast matrix multiplication. This may be 
expressed as 



A = 
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where An and Ln are b-hy-b, and P is a permutation. Thus the steps of the algorithm are as follows: 

Uu. 



a. Factor 



An 
A 2 i 



P 



Ln 
L21 



b. Apply P T to the rest of the matrix columns (no arithmetic operations). 

c. Solve the triangular system LnUyi = -A 12 for ^12- 

d. Update the Schur complement A22 — A22 ~ £21^12- 

e. Repeat the procedure on ^22- 



2T)] for steps (a) and (c) costs 0(nb 2 ). Step (d) involves matrix multiplication 
0(n 2 6 7 ~ 2 ). Repeating these steps n/b times makes the total cost 0(n 2 b + 



The conventional algorithm [2z 
at a cost MM(n — b,b,n — b) 
n 3 b~<- 3 ). 

9 00.. 1 9-2-y 

To roughly minimize this cost, choose b to make n b = n b 1 ~ , yielding b = n 4 -~< and jfcops = 0(n 4 -~< ). 

For 7«3, the cost is near the usual 0(n 3 ), but as 7 decreases toward 2, b drops to n 1 / 2 but the #ops only 

drops to 0(n 2 - 5 ). 

This same big-0 analysis applies to QR factorization: When A is n-by-m and real and n > m, then 
we can write A — QR where Q is n-by-n and orthogonal and R is n-by-m and upper triangular. We will 
represent Q compactly by using the WY representation of Q [9]: Q T can be written Q T = I — WY, where 
W and Y T are both n-by-m and lower triangular, W's columns all have 2-norm equal to 1, and Y's columns 
all have 2-norm equal to 2. (An even lower- memory version of this algorithm [33] is used in practice [Tl 110]. 
but we use [9] for simplicity of presentation.) The conventional algorithm using the WY representation or 
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variations costs 0(nm 2 ) operations to compute R, 0(nm 2 ) operations to compute W and Y, and 0(n 2 m) 
operations to explicitly construct Q, or 0(n 3 ) in the square case [9]. 

The algorithm for block QR factorization is entirely analogous to block LU factorization, processing 
the matrix in blocks of b columns at a time, updating the trailing part of the matrix using fast matrix 
multiplication, based on the following identity, where A\ is n-by-6 and A 2 is n-by-(m — b): 



A = [A 1 ,A 2 ] = [Q 1 R 1 ,A 2 } = Q 1 [R 1 ,Q^A 2 ] 

= Q 1 [R 1 ,(I-W 1 Y 1 )A 2 ] = Q 1 [R 1 ,A 2 -W 1 (Y 1 A 2 )] = Q 1 [R 1 ,A 2 ] 

where Qj = I — W\Y\. The cost of this step is 0(nb 2 ) for A\ = Q\R\ plus 

MM(b,n,m - b) + MM(n,b,m- b) + n(m -b) = 0{nmb"<~ 2 ) 



(3) 



for A 2 . Repeating this procedure (to — b) jb times on the last n — b rows of A 2 = 
A 22 = Q 2 R 2 = (I - W 2 Y 2 ) T R 2 . Combining this with ^ yields 
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eventually yields 



A = Qi 



Rn 
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Q2R2 



= Qi 
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Q 2 



R11 A12 
R 2 



= Qi • Q2 ■ R = Q ■ R 



(4) 



In practice we leave Q — Qi ■ Q 2 in this factored form (note the Q 2 will also be a product of factors) since 
that is faster for subsequent purposes, like solving least squares problems. Thus the cost is bounded by 
(to/6) times the cost of ©, namely Oinmb + nm 2 ^ -3 ). When n = to, this is the same cost as for block 
Gaussian elimination. In the general case of to < n, we again roughly minimize the cost by choosing b so 
nmb — nm 2 b' 1 ~ 3 , namely b — m 1 ^ A ~~ < \ leading to a cost of 0(nm i -~< ). As 7 drops from 3 toward 2, this 



cost drops from 0(nm 2 ) toward 0(r 



,1.5 



If we wish, we may also multiply out the Qf factors into a matrix of the form / — WY where W and Y T 
are n-by-m and lower triangular. Getting W and Y costs 0(nm 2 6 7-3 ), and multiplying out / — WY costs 
an additional 0(n 2 mb' 1 ~' 3 ). This does not change the cost in a big-0 sense when m/n is bounded below. 
The following equation shows how: 



Q 1 



(i 





w 2 



[0,Y 2 ])-(I-Wi-Yi) 



= {I-W 2 -Y 2 )-{I-W x -Yx) 

= (J-[Qf •Wi,W 2 ]-[F i; y 2 ]) 

= (I- [W t - W 2 ■ (Y 2 ■ Wi), W 2 ] ■ [Yx-,%]) 

= I -WY 

(here we have used Matlab notation like [Yi; Y2] to stack Y\ on top of Y 2 ). Now the cost minimizing b leads 
to a cost of 0(n 4 --i to). 

In summary, conventional block algorithm guarantee stability but can only reduce the operation count 
to 0(n 2 5 ) even when matrix multiplication costs only 0(n 2 ). To go faster, other algorithms are needed. 



3 Logarithmically Stable Algorithms 

Our next class of fast and stable algorithms will abandon the strict backward stability obtained by con- 
ventional algorithms or their blocked counterparts in the last section in order to go as fast as matrix mul- 
tiplication. Instead, they will use extra precision in order to attain roughly the same forward errors as 
their backward stable counterparts. We will show that the amount of extra precision is quite modest, and 
grows only proportionally to logn. Depending on exactly how arithmetic is implemented, this will increase 
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the cost of the algorithm by only a polylog(n) factor, i.e. a polynomial in \ogn. For example, if matrix 
multiplication costs 0(n 7 ) with 2 < 7 < 3, then for a cost of 0(n y polylog(n)) = ©(n 7 ^) for arbitrarily 
tiny r\ > one can invert matrices as accurately as a backward stable algorithm. We therefore call these 
algorithms logarithmically stable. 

To define logarithmic stability more carefully, suppose we are computing y — f(x). Here x could denote a 
scalar, matrix, or set of such objects, equipped with an appropriate norm. For example, y — f({A, b}) — A~ 1 b 
is the solution of Ay — b. Let Kf(x) denote the condition number of /(), i.e. the smallest scalar such that 

\\f(x + Sx)-f(x)\\ \\6x\\ (\\8x\\\\ 

Let alg(x) be the result of a backward stable algorithm for f(x), i.e. alg(x) = f(x + Sx) where \\Sx\\ = 
0(e)||x||. This means the relative error in alg(x) is bounded by 

\\alg(x) - f(x)\\ _ ru ^^ f ^ , ^^ 2 N 



0(e)K f (x) +0{s 2 



Definition 3.1 (Logarithmic Stability). Let algi s {x) be an algorithm for f(x), where the "size" (e.g., 
dimension) of x is n. If the relative error in algi s (x) is bounded by 

\^9is(x)-f(x)\\ = x{n) + 2 

ll/wll 1 

where x( n ) > 1 is bounded by a polynomial in logn, then we say algi s (x) is a logarithmically stable algorithm 
for f(x). 

Lemma 3.2. Suppose algi s {x) is a logarithmically stable algorithm for f{x). The requirement that algi s (x) 
compute an answer as accurately as though it were backward stable raises its bit complexity only by a factor 
at most quadratic in x(n), i.e. polynomial in logn. 

Proof. A backward stable algorithm for f(x) running with machine precision e^s would have relative error 
bound 0(ei, s )Kf(x) = t. A relative error bound is only meaningful when it is less than 1, so we may assume 
r < 1. Taking logarithms yields the number of bits bb s of precision needed: 

hs = log 2 — = log 2 - + log 2 K f (x) + 0(1) . (6) 

£bs T 

Recall that each arithmetic operation costs at most 0(bl s ) bit operations and as few as 0(bb s log 6& s log log 6& s ) 
if fast techniques are used [45] , 

To make the actual error bound for algi s (x) as small as r means we have to choose £; s to satisfy 

0(ei s )K > f (x) = t. Again taking logarithms yields the number of bits 6; s of precision needed: 

b ls = log 2 — = log 2 - + X (n) ■ log 2 K f (x) + 0(1) < x{n)b bs + 0(1) (7) 

This raises the cost of each arithmetic operation in algi s (x) by a factor of at most 0(x 2 ( n )) as claimed. 

Thus, if algisix) were backward stable and performed 0(n c ) arithmetic operations, it would cost at 
most 0(n c bl s ) bit operations to get a relative error r < 1. Logarithmic stability raises its cost to at most 
0(n c x 2 {n)bl s ) bit operations to get the same relative error. □ 

3.1 Recursive Triangular Matrix Inversion 

First we apply these ideas to triangular matrix inversion, based on the formula 

— 1 rji — 1 rp rji — 1 

11 • J 12'J 22 
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where T\\ and T 22 are §-by-^ and inverted using the same formula recursively. The cost of this well-known 
algorithm [IT] [31] is 

cost{n) = 2 cost{n/2) + 2MM {n/2, n/2, n/2) = 0(n 7 ). 

Its error analysis in [32] (Method A in Section 6) used the stronger componentwise bound [34] [eqn. (3.13)] 
that holds for conventional matrix multiplication (as opposed to ([1])) but nevertheless concluded that the 
method was not as stable as the conventional method. (The motivation for considering this algorithm in [32] 
was not fast matrix multiplication but parallelism, which also leads to many block algorithms.) 

To show this algorithm is logarithmically stable, we do a first order error analysis for the absolute error 
err(T _1 , n) in the computed inverse of the n-by-n matrix T . We use the fact that in computing the product 
of two matrices C — A ■ B that have inherited errors err{A,n) and err(B,n) from prior computations, we 
may write 

err(C,n) — n(n)e\\A\\ ■ \\B\\ from matrix multiplication 

+ • err(B,n) amplifying the error in B by \\A\\ (8) 

+err(A,n) ■ \\B\\ amplifying the error in A by ||B|| 

We will also use the facts that \\T U \\ < \\T\\ (and < HT^H) since T it is a submatrix of T (and T^ 1 is 

a submatrix of T -1 ). Therefore the condition number k(Tu) = \\Tu\\ • \\T^ \\ < k(T). Now let err(n') be a 
bound for the normwise error in the inverse of any n'-by-n' diagonal subblock of T encountered during the 
algorithm. Applying ([8]) systematically to the recursive algorithm yields the following recurrences bounding 
the growth of err(n). (Note that we arbitrarily decide to premultiply Tyi by T^ 1 first.) 

err(T i ~ 1 , n/2) < err(n/2) ... from inverting T\\ and T22 

err{T^ 1 ■ T 12 ,n/2) < n{n/ '2)e\\T^\\ • ||T 12 || + err(T n \ n/2) \\T 12 \\ 

... from multiplying • Tyi 
< M(n/2)e||T- 1 || • ||T|| + err(n/2)||T|| 
err{{T^ ■T 12 )-T 2 7 2 \n/2) < n{n/2)e\\T^ ■ T 12 || • \\T£\\ 



-err{T^ • T 12 , n/2)- || T£ \\ 
-WT^ 1 ■T l 2\\-err{T^,n/2) 



from multiplying (T^ 1 • Ti 2 ) • T 22 



< n{n/2)e\\T^\\ • ||T|| • HT- 1 !! 

+ ( A1 (n/2) £ ||T- 1 || • \\T\\+err(n/2)\\T\\) ■ WT^W 
+ \\T^\\ ■ \\T\\ ■ err{n/2) 
err(T~\n) < err(T n \ n/2) + err (T 2 ' 2 \ n/2) + err{(T^ ■ T 12 ) • T 22 1 ,n/2) 

< 2err(n/2) 
+^n/2)e\\T- 1 \\-\\T\\-\\T- 1 \\ 

+ ( f i(n/2)e\\T- 1 \\ ■ \\T\\ + err(n/2)\\T\\) ■ WT^W 
+ \\T- 1 \\ ■ \\T\\ ■ err(n/2) 

< 2(k(T) + l)err(n/2) + 2^(n/2)eK,(T)\\T~ 1 \\ 

Solving the resulting recurrence for err(n) |20] [Thm. 4.1] yields 

err(n) = 2{n{T) + l)err(n/2) + 2u(n/2)eK(T)\\T- 1 \\ 
= 0(u(n/2)e K (T)(2(K(T) + l))^"||r- l ||) 

showing that the algorithm is logarithmically stable as claimed. 
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3.2 Recursive Dense Matrix Inversion 



A similar analysis may be applied to inversion of symmetric positive definite matrices using an analogous 
well-known divide-and-conquer formula: 
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where S = C- B T A~ X B 



A~ 



X A- 1 BS- 1 B T A- 1 
-S- 1 B T A- 1 



-A^BS- 1 

s- 1 



To proceed, we need to state the algorithm more carefully, also deriving a recurrence for the cost C(n) 

function Hi = Recur siveInv(H,n) ... invert n-by-n s.p.d. matrix H recursively 
if (n=l) then 
Hi = 1/H 



else 



Ai = RecursiveInv(A, n/2) 
AiB = Ai- B 
BAiB = B T ■ AiB 
S = C - BAiB 
Si = RecursiveInv(S, n/2) 
AiBSi = AiB ■ Si 
AiBSiBAi = AiBSi ■ (AiB) T 
Hin=Ai + AiBSiBAi 
return Hi = [[Hin, -AiBSi]; 



cndif 



cost = C(n/2) 
cost = MM (n/2) 
cost = MM (n/2) 
cost = (n/2) 2 
cost = C(n/2) 
cost = MM (n/2) 
cost = MM (n/2) 
cost = (n/2) 2 
[(-AiBSi) T ,Si}} 



Assuming MM(n) = 0(n J ) for some 2 < 7 < 3, it is easy to see that the solution of the cost recurrence 
C(n) = 2C(n/2) + AMM(n/2) + n 2 /2 = 0(MM(n)) as desired. 

For the rest of this section the matrix norm || • || will denote the 2-norm (maximum singular value). To 
analyze the error we exploit the Cauchy Interlace Theorem which implies first that the eigenvalues of A 
interlace the eigenvalues of H, so A can be no worse conditioned than H, and second that the eigenvalues 
of S^ 1 (and so of S) interlace the eigenvalues of H~ x (and so of H, resp.), so S can also be no worse 
conditioned than H. Letting A and A denote the smallest and largest eigenvalues of H, resp., we also 
get that II^A-^II < ||C|| + \\S\\ < 2A and \\A^ BS- 1 B T A^\\ < WA^W + || ^ — 1 1| < 2/A, all of which 
inequalities we will need below. 

As before, we use the induction hypothesis that err(n') bounds the error in the inverse of any n'-by-n' 
diagonal block computed by the algorithm (including the errors in computing the block, if it is a Schur 
complement, as well as inversion). In particular we assume err(A , n/2) < err (n/2). Then we get 



err(AiB,n/2) < u(n/2) 
err(BAiB,n/2) < fi(n/2) 



1- 

A 

A 



A 



A 2 

< 2n(n/2) ■ e ■ — 
A 



err(S, n/2) < 



e • A - 



■ err(n/2) ■ A 
- A • err(AiB,n/2) 
A 2 • err(n/2) 
err(BAiB,n/2) 



.using ((Sj) with no error in B 



.using ijHJl with no error in B 



« err(BAiB,n/2) 
err(Si, n/2) < err(n/2) + — ■ err(S, n/2) 

A 2 
A 3 



.using (S + 5S)- 1 w - S^SSS' 



< err(n/2) + 2\i(n/2) 



A 2 
A 2 



err(n/2) 
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err(AiBSi, n/2) < //(n/2) • e ■ — ■ \ + T • err(Si, n/2) + err(AiB, n/2) ■ \ 

AAA A 

. ;oN ,2A 2A 3 N ,2A A 3 . , 
< A*(n/2) ' e ' + —) + (— + as )c"-("/2) 



err(AiBS , iSv4z, n/2) < /i(n/2) • e 



1 A 
A ' A 



Y • err(AiB, n/2) + err(AiBSi, n/2) ■ ^ 
A A 



..using 



.using 



, . . ,2A 2A 2 2A 4 A 2A 2 A* , 
< M (n/2) • £ • ( ^ + -^3- + -^-) + ( j + -g- + ^)err(n/2) 



In 1 

err(Hi lu n/2) < \ j -e - - + err(A~ 1 , n/2) + err(AiBSiBAi,n/2) 
2 A 

,2A 



A 4 



... 2A 2 2A* , A 2A 2 
M"/2). £ .(^ + — + — ) + (! + _ + _ + _ 

err(Hi,n) < err(Hin,n/2) + err(AiBSi,n/2) + err(Si,n/2) 

A .A 2 A 3 A 4 



)err(n/2) 



< u(n/2) ■ e ■ (4— + 4— + 2— + 2—) + err(n/2) • (2 + 3 T 



,A 
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A 4j 



This yields a recurrence for the error, where we write k 

A A 2 A 3 A 4 
err(n) < u(n/2) ■ e ■ (4— + 4— + 2— + 2— ) - 



■ err(n/2) ■ (2 



4 



3^ 
A 2 



A 3 
A 3 



A 4J 



< 



12 • u(n/2) • e • — + 10 • k 4 • err(n/2) 
A 

Solving this recurrence, we get 

err(n) = 0{en(n)n 4 (10 k 4 ) 10 ^- "A" 1 ) = 0(ep(n)n I °*' 10 K 4 + 41o ^"||i/- 1 | 



(9) 



showing that recursive inversion of a symmetric positive definite matrix is logarithmically stable. 

To invert a general matrix we may use A -1 = A T ■ (A ■ A T ) _1 . Forming A ■ A T only squares A's condition 
number, and first order error analysis shows the errors contributed from the two matrix multiplications can 
only increase the exponent of k in ([9]) by doubling it and adding a small constant. Thus we may also draw 
the conclusion that general matrix inversion is logarithmically stable. The same reasoning applies to solving 
Ax = b by multiplying x — A T ■ (A ■ A T y 1 ■ b. 

Finally, we return to our claim in the introduction: 

Theorem 3.3. If we can multiply n-by-n matrices in 0(n w+ri ) arithmetic operations then we can invert 
matrices stably in 0(n (Aj+ri ) bit operations. Conversely, if we can invert matrices stably in 0(n" +7? ) bit 
operations (resp. exactly in 0(n UJ+ri ) arithmetic operations) then we can multiply matrices stably in 0(n^ +v ) 
bit operations (resp. exactly in 0(n w+rl ) arithmetic operations). 

Proof. We have just proven the first claim, where we rely on logarithmic stability of inversion to bound the 
number of bit operations. 

For the converse implications, we simply use 



" I A 





-l 


' I -A A-B 


I 


B 




I -B 




I 




I 



Clearly, inverting the matrix on the left exactly in 0(n UJ+ri ) arithmetic operations lets us extract the product 
A ■ B. Given only a logarithmically stable inversion routine, we can make the condition number near 1 by 
scaling A and B to have norms near 1, implying that the above block matrices are very well conditioned, 
and the inverse can be computed accurately without extra precision. □ 



8 



It is tempting to summarize this theorem by saying "matrix multiplication is possible in 0(n u+TI ) oper- 
ations if and only if stable inversion is," but the difference between counting bit operations and arithmetic 
operations requires a more careful statement (a bound on the number of arithmetic operations can be used 
to bound the number of bit operations, but not conversely, since bit operations may conceivably not organize 
themselves into easily recognized arithmetic operations). 

4 Simultaneous Speed and Backward Stability of QR and LU 

We show that QR decomposition can be implemented stably and as fast as matrix multiplication. We exploit 
the fact that linear equation solving and determinant computation as well as solving least squares problems 
can be reduced to QR decomposition to make the same statements about these linear algebra operations. 
Similar statements can be made about LU decomposition, under slightly stronger assumptions about pivot 
growth than the conventional algorithm. 

4.1 Fast and Stable QR Decomposition 

We now describe in more detail the following recursive variation of a conventional QR algorithm [5], which 
was presented in [57]. Let A be an n-by-m matrix with n > m. The function [R, W, Y] = QRR(A,n 7 m) 
will return an m-by-m upper triangular matrix _R, an n-by-m matrix W, and an m-by-n matrix Y with the 
following properties: (1) Q T = I — WY is an n-by-n orthogonal matrix, (2) each column of W has unit 
2-norm, (3) each row of Y has 2-norm equal to 2, (4) A = Q ■ [R; zeros(n — to, to)] is the QR decomposition 
of A (here and later we use MATLAB notation) . 

function [R, W, Y] = QRR(A) ... A is n-by-m, with n > to 
if (to = 1) then 

compute W and Y in the conventional way as a Householder transformation [34, sec. 19.1], 
with the normalization that ||W|| 2 = 1, \\Y\\ 2 = 2 and R = ±\\A\\ 2 

else 

(a) [R L , W L , Y L ] = QRR(A{\ ■ n, 1 : Lf J)) 

... compute QR decomposition of left half of A 

(b) A(l : n, Lf J + 1 : m) = A(l : n, Lf J + 1 : to) - W L ■ (Y L ■ A(l :n,[f\+l: to)) 

... multiply right half of A by Q T 

(c) [R R , W R ,Y R ] = QRR{A{[f\ + 1 : n, [f J + 1 : to)) 

... compute QR decomposition of right half of A 

(d) X = W L - [zer 0S (Lf J, Lf J); W R ■ (Y R ■ W L ([f J + 1 : n, 1 : Lf J)] 

... multiply two Q factors 
R=[[R L ,A(1 : Lf J, Lf J + 1 : m)]; [zeros(\f], [f\),R R ]] 
W=[X, [ Z eros(lf\,[f]);W R ]} 
Y = [Y L ;[zeros(\fl[f\),Y R }] 
endif 



The proof of correctness is induction based on the identity (I — [0; W^r][0, Y r ]) ■ (I — WlYl) = I — WY 
as in Section 2 above. For the complexity analysis we assume to is a power of 2: 

vn 

cost(n,m) — cost(n, — ) ...cost of line (a) 

I [ / ffl I It; I I t III 

+MM( — ,n,—)+MM(n,—,—) + n— ...cost of line (b) 

TTi TTt 

-\-cost(n , — ) ...cost of line (c) 

,171 TTI m N , 111 777 777 , , 171,171 , . 

+MM(-,n - -, -) + MM(n - -, -, -) + (n - -)- ...cost of line (d) 
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< 2cost(n,-) + 8-MM(-,-,-) + 0(nm) 

< 2cost(n, y ) + C^nm 7 " 1 ) 



...assuming 7 > 2 



When n = m, this means the complexity is 0(n 7 ) as desired. 

This algorithm submits to an analogous backward error analysis as in [9] or [34] [sec. 19.5], which we 
sketch here for completeness. 

Lemma 4.1. The output of[R,W,Y] = QRR(A) satisfies (I — WY + 6Q T ) (A + 5 A) = [R; zeros(n - m,m)] 
where Q T = I — WY + 6Q T satisfies QQ T = I exactly, \\SQ T \\ = 0(e), and \\SA\\ = 0(e)\\A\\. (Here we let 
0() absorb all factors of the form n c .) 

Proof. We use proof by induction. The base case (m = 1) may be found in [33] [sec. 19.3]. Let Al = A(l : 
n, 1 : LyJ) an d An = A(l : n, [^J + 1 : m )- From the induction hypothesis applied to step (a) of QRR we 
have 



(I - W L Y L + SQl)(A L + 5A L ) = [R L : 



<n-[-\,[-\)] with Qi^I-W L Y L + 5Qi , 



QlQ L = 7, \\SQl\\ = 0(e) and \\SA L \\ = 0(e)\\A\\. Application of error bound |T]) to step (b) yields 



.4 



R,new 



= A R - W L ■ (Y L ■ A R ) + SA RA = Q T L (A R + SA RA ) with SA RA = -Q L ■ 5Q T L -A R + Q L - SA RA 



so ||(L4r,i|| = 0(e)||A||. Write A R . new — [A Rt x] A R . 2 ] where A R .\ is LT-l"ky - rT~l • The induction hypothesis 
applied to step (c) yields 

(I - W R Y R + 5Q T r)(Ar, 2 + 8A R , 2 ) = [R R ; zeros(n - m, [y]] with Q T R = I - WrYr + 5Q T R 
Q^Qr = I: W^Q'rW = 0(e), and ||<L4r i2 || = Combining expressions we get 



Q T R 



Ql-(A + SA) 



Rl A Rf \ 
Rr 




where 



SA 



SA L ,6A R i+Q L 



zeros([f\,\f}) 
SA Rt2 



satisfies \\SA\\ = 0(e)\\A\\. Finally repeated application of bound {T]) to step (d) shows that X = X trU e- 
with \\SX\\ = O(e), W = Wtme + [SX,zeros(n, [f ])], and 



Q T ^ 



Ql 



1 

Ql 

1 
I-WrYr + SQI 

= I-W true Y + SQ T 
= I- WY + 6Q T 

with QQ T = I, \\5Q T \\ = 0(e) and \\SQ T \\ = 0(e) as desired. 



(I - W l Yl + SQl) 



-SX 



□ 



Armed with an A = QR decomposition, we can easily solve the linear system Ax = b stably via x = 
R~ 1 Q T b straightforwardly in another 0(n 2 ) operations, or solve a least squares problem stably. Furthermore 
dct(A) = (—1)™ Yli Ru m a l so easily computed. In summary, high speed and numerical stability are achievable 
simultaneously. 
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4.2 Fast and Stable LU Decomposition 

There is an analogous algorithm for LU decomposition [50]. However, in order to update the right half of the 
matrix after doing the LU decomposition of the left half, it appears necessary to invert a lower triangular 
matrix, namely the upper left corner of the L factor, whose inverse is then multiplied by the upper right 
corner of A to get the upper right corner of U . As described in the last section, triangular matrix inversion 
seems to be only logarithmically stable. However, because of pivoting, one is guaranteed that Lu = 1 
and \L{j\ < 1, so that k(L) is generally small. Thus as long as L is sufficiently well conditioned then LU 
decomposition can also be done stably and as fast as matrix multiplication. Now we sketch the details, 
omitting the implementation of pivoting, since it does not contribute to the complexity analysis: 

function [L, U] = LUR(A) ... A is n-by-m, with n > m 
if (m=l) then 

L = A/A(1), U = A(l) 

else 

(a) [L L ,U L ]=LUR(A(l:n,l:[f\)) 

... compute LU decomposition of left half of A 

(b) A(l : Lf J, Lf J + 1 : m) = (L L (1 Lf J))" 1 ■ A(l : Lf J, L?J + 1 : m); 

... update upper right corner of A 

(c) A([f J + 1 : n, [f J + 1 : m) = A([f J + 1 : n, [f J + 1 : m)- 

L L (LfJ+l:n,l: LfJ)-A(l: LfJ,LfJ+l:m); 
... update Schur complement 

(d) [L R , Ur] = LUR(A([f\ + 1 : n, Lf J + 1 : m)) 

... compute LU decomposition of right half of A 

(e) L=[L L ,[zeros(lf\,\f]);L R ]}; 

(f) U = [[U L ,A(1 : Lf J, Lf J + 1 : "»)]; [zeros(\f 1, Lf \),U R ]); 
endif 



For the complexity analysis we assume m is a power of 2 as before: 



Tfl 

cost(n,m) ~ cost(n, — ) ...cost of line (a) 

Tfl 

+0{MM(—)) ...cost of line (b) 

+MM(n--,-,-) + (n--)- ...cost of line (c) 

Tfi tn 

-\-cost(n — — ) ...cost of line (d) 

771 TL 771 777 777 777 

< 2cost(n, -) + 2-MM{-, -, — ) + 0{nm + MAf (-)) 

2 m 2 2 2 2 

< 2cost(n, y) +0{nm 1 ~ 1 ) 

= 0(rim 7_1 ) ...assuming 7 > 2 

When n = m, this means the complexity is 0(n 7 ) as desired. 

Now we establish backward stability under the assumption that L (and so every diagonal block of L) is 
sufficiently well conditioned (and its norm sufficiently close to 1) that the error in the computed matrix from 
step (b) is bounded in norm by 0(e||A||): 

Lemma 4.2. If L in the output of [L, U] — LUR(A) is sufficiently well conditioned, then L ■ U = A + 8 A 
where \\8A\\ = 0(e)\\A\\. (Here we let 0() absorb all factors depending on \\L\\, ||L _1 ||, and n. We also 
assume without loss of generality that the rows of A are in the correct pivot order.) 
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Proof. We use proof by induction. The base case (to = 1) is straightforward. Let Al = A(l : n, 1 : |_ifj)) 
L Ltl = L(l: LfJ,l: Lf J), L L>2 = L([% J + 1 : n, 1 : [f J), A R> i = A(l : Lf J , Lf J + 1 : m), and A fl , 2 = 
A(Lf J + 1 : n, Lf J +1 : to). Then from the induction hypothesis applied to step (a), L L ■ Ul = A L + SA L 
with ||<L4x|| = 0(e\\ A\\). From step (b) and the assumptions about L, the updated value of Ari is given by 

A' R ^=L- L \-A RA +SA' RA with ||«U' fl)1 || = 0( £ |L4||) . 

From step (c) the updated value of A R2 is given by 

A'r, 2 = A R ,2 - Ll, 2 • A' R}1 + 5A' R2 with \\SA' R J = 0(e\\A\\) . 

From the induction hypothesis applied to step (d) we get 

L R ■ U R = A' R2 + 6A R<2 with \\SA% 2 \\ = 0(e\\A\\) . 

Combining these results yields 

L ■ U = A + 8A with 8 A = [SA L , [L L ,iSA' Rtl ;SA' R2 + SA' R 2 ]] , 

so \\SA\\ = 0(e||A||) as desired. □ 

The assumption that L is sufficiently well-conditioned is a variation on the usual assumption that pivot 
growth is limited, since pivot growth is bounded by (with the norm depending on how pivot growth 

is measured), and ||£||i is at most n. 

4.3 Columnwise Backward Error 

The error analysis of conventional 0(n 3 ) algorithms for the QR and LU decomposition actually yield some- 
what stronger results than normwise backward stability: they are normwise backward stable column-by- 
column. This means, for example, that LU is the exact factorization of A + 5 A where the i-th column of 6 A 
is small in norm compared to the i-th column of A. As stated, our algorithm does not have this property, 
since the fast matrix multiplication algorithm can and probably will "smear" errors between columns. But 
there is a simple fix to avoid this, and get the same columnwise error bound as the standard algorithms: (1) 
Preprocess A by dividing each column by its norm (say the infinity norm); save the values of the norms for 
step (3). (2) Compute the fast QR (or LU) factorization of the scaled A. (3) Multiply the zth-column of R 
(or of U) by the norm of the i-th column of A. 

It is easy to see that this additional work costs only 0(n ), and makes the backward error in column i 
proportional to the norm of column i. 

More generally, one could improve bound |T]) cither to \C comp ^j — Cy-| < /x(n)e|| A(i, :)\\ ||S(:,j')|| or to 
|| C comp — C\ < /j.(n)e\\ \A\-\B\ || by appropriately scaling rows and/or columns of A and B before multiplying 
them, and unsealing C comp afterwards if necessary. 

5 Fast and Stable Randomized Rank Revealing URV 

We also show how to implement a rank revealing URV decomposition based on QR decomposition stably and 
fast; this will be required for solving eigenvalue problems in the next section. Our rank revealing algorithm 
will be randomized, i.e. it will work with high probability. As we will see in the next section, this is adequate 
for our eigenvalue algorithm. 

Given a (nearly) rank deficient matrix A, our goal is to quickly and stably compute a factorization 
A = URV where U and V are orthogonal and R is upper triangular, with the property that it reveals 
the rank in the following sense: Let o~\ > • • • > <j n be the singular values of A. Then (1) with high 
probability cr m i n (i?(l : r, 1 : r)) is a good approximation of oy, and (2) assuming there is a gap in the 
singular values (oy+i <C oy) and that R(l : r, 1 : r) is not too ill-conditioned, then with high probability 
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Cmax(5(r + 1 : n,r + 1 : n)) is a good approximation of o r +\. This is analogous to other definitions of 
rank-revealing decompositions in the literature [15l [35l [16j HI 023 [47] , with the exception of its randomized 
nature. 

The algorithm is quite simple (RURV may be read "randomized URV" ) 

function [U,R,V] = RURV(A) ... A is n-by-n 

generate a random matrix B whose entries are independent, identically distributed 
Gaussian random variables with mean and standard deviation 1 (i.i.d. N(0,1)) 

(a) [V, R] = QRR(B) ... V is a random orthogonal matrix 

(b) A = A ■ V T 

(c) [U, R] = QRR(A) 



Thus U ■ R = A = A ■ V T , so U ■ R- V = A. The cost of RURV is one matrix multiplication and two calls 
to QRR, plus 0(n 2 ) to form B, so 0(n u+TI ) altogether. The matrix V has Haar distribution |41j . i.e. it is 
distributed uniformly over the set of n-by-n orthogonal matrices. For information on efficient generation of 
such matrices, see [21 138]. 

It remains to prove that this is a rank-revealing decomposition with high probability (for simplicity we 
restrict ourselves to real matrices, although the analysis easily extends to complex matrices): 

Lemma 5.1. Let f be a random variable equal to the smallest singular value of an r-by-r submatrix of 
a Haar distributed random n-by-n orthogonal matrix. Assume that r is "large" (i.e, grows to oo as some 
function of n; no assumptions are made on the growth speed). Let a > be a positive constant. Then there 
is a constant c > 1 such that, as soon as r and n are large enough, 



Pr 



f< 



1 



< 



Proof. Recall that the n-by-n Haar distribution has the following important property: any column, as well 
as the transpose of any row, is uniformly distributed over the n — 1 unit sphere. As such, without loss of 
generality, we can restrict ourselves to the study of the leading r-by-r submatrix. 

Denote by V the orthogonal matrix, and let U = V(l : r, 1 : r). We will assume that V came from the 
QR factorization of a n-by-n matrix B of i.i.d. Gaussians (for simplicity, as in RURV), and thus V — BR -1 . 
Moreover, U — B(l : r, 1 : r)(R(l : r, 1 : r)) _1 . Therefore 

> 



> 



> 



^min^ 1 : 


r,l : 


r)) • a min ((i?(l 


a mm( B ( l 


r,l 


r)) 


Cmax(-R(l 


:r,l 


r)) ' 


a min (B(l 


:r,l 


r)) 


0max(-B(l 


: n, 1 


:r)) ' 




r,l : 


r)) 



\B(l:n,l:r)\\ F 



where || \\p denotes the Frobenius norm. 
Thus we shall have that 



f< 



1 



.a+l 



< Pr 



We now use the following bound: 



nin (£?(l : r, 1 : r)) 
\B(l:n,l:r)\\ F 



< 



1 



,a+l 



< Pr 



nin (i?(l : r, 1 : r)) 
\B(l:n,l:r)\\ F 



.(5(1 : r, 1 : r)) < 



< 



r a+l/2 



Pr [115(1 : n,l 



> 2v^n] 
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The limiting (asymptotical) distribution (as r — * oo) of r • o~ m [n(B(l : r, 1 : r)) 2 has been computed in 
Corollary 3.1 of [26] and shown to be given by 

f( x ) = I±^E e -(x/2+^) . 

2Va; 

The convergence was shown to be very fast; in particular there exists a (small) constant Cq such that 

Pr [r- 2 cr min (B(l :r,l:r))<x] < c [ f(t)dt < coVx , 

Jo 

for any x > 0. After the appropriate change of variables, it follows that there is a constant c\ such that 

Pr 



2 

cr min( B ( 1 -r,l:r))< 



r a+l/2 



for all r. 

On the other hand, the distribution of the variable \\B(1 : n,l : r)\\p is Xnr, with \ being the square 
root of the \ 2 variable. As the probability density function for \m is 

9rn{x) = 2J .„/2_i r f^rn^ X 2 1 e X /2 , 

and simple calculus gives the bound 

Pr : n, 1 : r)\\ F > < ; ( 12 ) 

for all r and n. 

From (JTUJ) , (fTTj) . and (fT2|) we obtain the statement of the lemma. □ 



Lemma 15.11 implies that, as long as r grows with n, the chance that / is small is itself small, certainly 
less than half, which is all we need for a randomized algorithm to work in a few trials with high probability. 

Theorem 5.2. In exact arithmetic, the R matrix produced by RURV(A) satisfies the following two condi- 
tions. First, 

f ■ o r < a min (R(l : r, 1 : r)) < y ct 2 + a 2 r+1 < V2 ■ a r , 

where f is a random variable equal to the smallest singular value of an r-by-r submatrix of a random n-by-n 
orthogonal matrix. Second, assuming o~ r +i < fo~r, 



f 

oy+i < cr max (i?(r + 1 : n, r + 1 : n)) < 3a r+1 



1 — r+1 



Given Lemma I5.il which says it is unlikely for f to be small, this means that if there is a large gap in the 
singular values of A (oy + i -C o~ r ) and R(l : r, 1 : r) is not too ill-conditioned (ay is not -C o~\), then with 
high probability the output of RURV is rank- revealing. 

Proof. In this proof \\Z\\ will denote the largest singular value of Z. Let A = P-Y,-Q T = P-diag(Si, E2)-Q T 
be the singular value decomposition of A, where Si = diag(cri, ...,oy) and S2 = diag(cr r +i, ...,cr n ). Let V T 
be the random orthogonal matrix in the RURV algorithm. Note that X = Q T ■ V T has the same probability 
distribution as V T . The intuition is simply that a randomly chosen subspace (spanned by the leading r 
columns of V T ) is unlikely to contain vectors nearly orthogonal to another r dimensional subspace (spanned 
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by the leading r columns of Q), i.e. that the leading r-by-r submatrix of X is unlikely to be very ill- 



conditioned. Write X = [Xl, X2] where X\ = 



X21 



has r columns, X2 — 



X12 
X22 



has n — r columns, 



Xu and X12 have r rows, and X21 and X22 have n — r rows. Then 



o- m in(-R(l : r, 1 : r)) = <r n 



51 ■ X11 

52 • X21 



> 0" m i n (Si ■ Xn) > a r ■ CTmm(Xl) = 0> • / 



where / = cr m i n (Xn) is a random variable with distribution described in Lemma |5. II Also 

<T^ n (i?(l : r, 1 : r)) = ^(Xf^Xn + 4s^ 21 ) < ^(X^Xu) + ^(4^2^21) < cr 2 r + a* +1 . 

Now let E • X = [E • X\, E • X2] = [Yi, Y2]. Then the nonzero singular values of R(r + 1 : n, r + 1 : n) are 
identical to singular values of the projection of Y2 on the orthogonal complement of the column space of Yi, 
namely of the matrix C = (I - Y 1 (Y^Y 1 )- 1 Y^)Y 2 . Write 

OO 

(YfYx)- 1 = {X^jX u + 42^21)^ = (5 - 5S)- 1 = S" 1 + ^ S'^S ■ S' 1 ) 1 

assuming the series converges. Assuming the product of of+i > ll^ll and of jt^t > 11^ II is l ess than 1, 
we have 



F = (Y^y 1 = (XfjS^Xn)- 1 + £ = X^E^Xi 



where \\E\\ < (^)/(l - ^J) and ||f|| < + ||£|| < 75^^ ■ Now we compute 
C 



{I -Y 1 {Y^Y 1 )- x Yl)Y 2 



I — Si XnYxT, Si 



11^1 — EiAnyA 2 r 1 E2 
—'E2X2iYX 11 'Ei I — E2A2iYA 2 i£2 

— ^iXnEXj 1 'Si — EiXliYX^£2 
— E2AT21YATHE1 / — E2X21YA21E2 



E1X2 
E2A22 





E1X2 




S2X22 



— E 1 X 11 ^X 11 E 1 X 12 
—T,2X2iYX-[ 1 'E<iXi2 



Thus ||C|| < HZill + 
HZill < l|Ei| 

and 



Zo 

\^\\-'r\\Z 3 \\ 



+ Z3 ■ E2A22 
E2M|X2||<||^l| 



— SiAnyA 2 iS2 
/ — E2X21YX21^2 



ay+i- Next, 



Y10X0 



\Xu\\ ■ \\E\\ ■ ||X 



|X 12 ||<^||E||< 



l_fltl 



+1 



1 _ _E±i 
1 /V? 



Il^all < ||S 2 || • IIX21II • ||Y|| • ||Xn|| • ||E?|| • ||X 12 || < a\a r+1 \\Y\\ < 
which is smaller than the bound on \\Zi\\, as is oy+i. Altogether, we then get 



2Z2 



'r+l 



\\C\\ < 3a r+ i 



1 Pol 



as desired. 



□ 
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Corollary 5.3. Suppose A has rank r < n. Then (in exact arithmetic) the r leading columns of U from 
[U,R,V] — RURV(A) span the column space of A with probability 1. 

Lemma 5.4. In the presence of roundoff error, the computed output [U, R, V] = RURV(A) satisfies 

A + SA = U ■ R-V where U and V are exactly orthogonal matrices, \\5A\\ = 0(e)\\A\\, \\U — U\\ = 0(e), and 

\\V-V\\=0(e). 

Proof. Applying Lemma H. II to step (a) yields V = V — SV where V ■ V T = I and \\8V\\ = 0(e), Applying 
error bound ((T|) to step (b) yields A = A ■ V T + 5Ai where ||<5^4i|| = 0(e||A||). Applying Lemma I4TT1 to step 
(c) yields U ■ R = A + SA 2 where U ■ U T = I, SU = U - U satishes \\SU\\ = 0(e), and ||<L4 2 || = 0(e\\A\\). 
Combining these identities yields 

U ■ R = (A - A ■ 5V T ■ V + (6A 1 + 5A 2 ) ■ V) ■ V T = (A + SA) ■ V T 

where ||<L4|| = 0(e||A||) as desired. □ 

Lemma shows that RURV(A) computes a rank revealing factorization of a matrix close to A, which 
is what is needed in practice (see the next section). (We note that merely randomizing the order of the 
columns of A is not good enough: consider the case of an n-by-n matrix of rank r where n — r + 1 columns 
are all multiples of one another.) The question remains of how to recognize success, that a rank-revealing 
factorization has in fact been computed. (The same question arises for conventional rank-revealing QR, with 
column pivoting, which can fail, rarely, on matrices like the Kahan matrix.) This will be done as part of the 
eigenvalue algorithm. 

6 Eigenvalue Problems 

To show how to solve eigenvalue problems quickly and stably, we use an algorithm from [S] , modified slightly 
to use only the randomized rank revealing decomposition from the last section. As described in Section [6.11 
it can compute either an invariant subspace of a matrix A, or a pair of left and right deflating subspaces of 
a regular matrix pencil A — XB, using only QRR, RURV and matrix multiplication. Applying it recursively, 
and with some assumptions about partitioning the spectrum, we can compute a (generalized) Schur form 
stably in 0(n^ +r >) arithmetic operations. Section W% discusses the special case of symmetric matrices and 
the singular value decomposition, where the previous algorithm is enough to stably compute eigenvectors 
(or singular vectors) as well in 0(n UJ+v ) operations. But to compute eigenvectors of a nonsymmetric matrix 
(or pencil) in 0(n^ +r >) operations is more difficult: Section T6.3I gives a logarithmically stable algorithm 
SylR to solve the (generalized) Sylvester equation, which Section 16.41 in turn uses in algorithm EVecR 
for eigenvectors. However, the EVecR is only logarithmically stable in a weak sense: the accuracy of any 
computed eigenvector may depend on the condition numbers of other, worse conditioned eigenvectors. We 
currently see no way to compute each eigenvector with an error proportional to its own condition number 
other than by the conventional 0(n 2 ) algorithm (involving the solution of a triangular system of equations), 
for a cost of 0(n 3 ) to compute all the eigenvectors this accurately. 

6.1 Computing (generalized) Schur form 

The stability and convergence analysis of the algorithm is subtle, and we refer to [5] for details. (Indeed, 
not even the conventional Hessenberg QR algorithm has a convergence proof 21j.) Our goal here is to show 
that the algorithm in [S] can be implemented as stably as described there, and in 0(n u+?J ) operations. The 
only change required in the algorithm is replacing use of the QR decomposition with pivoting, and how 
rank is determined: instead we use the RURV decomposition, and determine rank by a direct assessment 
of backward stability described below. Since this only works with high probability, we will have to loop, 
repeating the decomposition with a different random matrix, until a natural stopping criterion is met. The 
number of iterations will be low since the chance of success at each step is high. 
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Q T AQ 



Rather than explain the algorithm in [5 in detail, we briefly describe a similar, simpler algorithm in order 
to motivate how one might use building blocks like matrix multiplication, inversion, and QR decomposition 
to solve the eigenvalue problem. This simpler algorithm is based on the matrix sign-function |44j : Suppose A 
has no eigenvalues with zero imaginary part, and let A = S ■ diag( J + , J_ ) ■ S~ 1 be its Jordan Canonical form, 
where J+ consists of the Jordan blocks for the r eigenvalues with positive real part, and J_ for the negative 
real part. Then define sign(A) = S ■ diag(+/ r , — 7„_ r ) • S^ 1 . One can easily see that P + = |(sign(v4.) + I) = 
S ■ diag(+/ r , 0) • S" 1 is the spectral projector onto the invariant subspace S+ of A for J + . Now perform a 
rank revealing QR factorization of P+, yielding an orthogonal matrix Q whose leading r columns span 5+. 
Therefore 

" A n A12 
A 22 

is a block Schur factorization: it is orthogonally similar to A, and the r-by-r matrix An has all the eigenvalues 
with positive imaginary part, and A22 has all the eigenvalues with negative imaginary part. 

We still need a method to compute sign(A). Consider Newton's method applied to find the zeros of 
f(x) = x 2 — 1, namely Xi+% — \{x{ + x^). Since the signs of the real parts of X{ and a;" 1 , and so of aft+ii 
are identical, Newton can only converge to sign(3?a;o). Global and eventual quadratic convergence follow by 
using the Cayley transform to change variables to £i = frqrp This both maps the left (resp. right) half plane 
to the exterior (resp. interior) of the unit circle, and Newton to ii+i = xf, whose convergence is apparent. 
We therefore use the same iteration for matrices: = \{Ai + A^ 1 ) (see 44 ). One can indeed show this 

converges globally and ultimately quadratically to sign(yl ). This reduces computing the sign function to 
matrix inversion. 

To compute a more complete Schur factorization, we must be able to apply this algorithm recursively to 
An and A 2 2, and so divide their spectra elsewhere than along the imaginary axis. By computing a Moebius 
transformation A — {a A + 131) ■ (jA + SI) , we can transform the imaginary axis to an arbitrary line or 
circle in the complex plane. So by computing a rank-revealing QR decomposition of ^(sign(j4) + I), we 
can compute an invariant subspace for the eigenvalues inside or outside any circle, or in any halfspace. To 
automate the choice of these regions, one may use 2-dimensional bisection, or quadtrees, starting with a 
rectangle guaranteed to contain all eigenvalues (using Gershgorin bounds), and repeatedly dividing it into 
smaller subrectangles, stopping division when the rectangle contains too few eigenvalues or has sufficiently 
small perimeter. 

The same ideas may be extended to the generalized eigenvalue problem, computing left and right deflating 
subspaces of the regular matrix pencil A — XB. 

Backward stability (though not progress) may be guaranteed by checking whether the computed A21 in 
An A 12 
A21 A22 

eigenvalue algorithms based on the matrix sign function. 

The algorithm in [5], which in turn is based on earlier algorithms of Bulgakov, Godunov and Malyshev 
[251 IT51 1551 [551 |4"U] . avoids all matrix inverses, either to compute a basis for an invariant (or deflating) 
subspace, or to split the spectrum along an arbitrary line or circle in the complex plane. It only uses QR 
decompositions and matrix multiplication to compute a matrix whose columns span the desired invariant 
subspace, followed by a rank-revealing QR decomposition to compute an orthogonal matrix Q whose leading 
columns span the subspace, and an upper triangular matrix R whose rank determines the dimension of the 
subspace. Stability is enforced as above by checking to make sure A21 is sufficiently small in norm, rejecting 
the decomposition if it is not. 

To modify this algorithm to use our RURV decomposition, we need to have a loop to repeat the RURV 
decomposition until it succeeds (which takes few iterations with high probability), and determine the rank 
(number of columns of A21). Rather than determine the rank from i?, we choose it to minimize the norm of 
A21, by computing ||A 2 i||,s = J2ij \A2i,ij\ for all possible choices of rank, in just 0(n 2 ) operations: 

repeat 

compute Q from RURV 
compute A = Q T AQ 



Q T AQ 



is sufficiently small in norm. See [5J 21 G2E H2] for a more complete description of 
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for i = 1 to n — 1 

ColSum(z) = li^j 

Rowsum(i + 1) = Z)j=i 
endfor 

NormA21(l) = ColSum(l) 
for i = 2 to n — 1 

NormA21(i) = NormA21(i - 1) + ColSum(i) - RowSum(i) 
... NormA21(i) = \\A(i + 1 : n, 1 : i)\\s 

endfor 

Let r be the index of the minimum value of NormA21(l : n — 1) 
until NormA21(r) small enough, or too many iterations 

We need the clause "too many iterations" in the algorithm to prevent infinite loops, and to account for 
the possibility that the algorithm was unable to split the spectrum at all. This could be because all the 
eigenvalues had positive imaginary part (or were otherwise on the same side of the dividing line or circle), 
or were too close to the dividing line or circle that the algorithm used. The error analysis of the algorithm 
in [5] is otherwise identical. 

Finally we need to confirm that the overall complexity of the algorithm, including updating the Schur form 
and accumulating all the orthogonal transformations to get a complete Schur form A = QTQ T , costs just 
0(n w+T1 ). To this end, suppose the original matrix A is n-by-n, and the dimension of a diagonal submatrix A 
encountered during divide- and-conquer is n. Then computing an n-by-n Q to divide A once takes 0(n UJ+ ' n ) 
operations as described above, applying Q to the rest of A costs at most 2 • MAl(n,h 7 n) = 0(nn" +r ' _1 ) 
operations, and accumulating Q into the overall Q costs another MM(n,n,n) = 0(nn" +, '~ 1 ). Letting 
cost(n) be the cost of all work associated with A then yields the recurrence 

cost(n) = 2cost(^) + 0(rT + ") + 0{nn u+r >- x ) = 2cost(^) + 0{nn u+r ^- x ) 

whose solution is cost(n) — 0(rm" J+?, ~ :L ) or cost(n) = 0(n li ' + '') as desired. 



6.2 Symmetric Matrices and the SVD 

When the matrix is symmetric, or is simply known to have real eigenvalues, simpler alternatives to the matrix 
sign function are known that only involve matrix multiplication [jj 1371 142] , and to which the techniques 
described here may be applied. Symmetry can also be enforced stably in the algorithm described above by 
replacing each computed matrix Q T AQ by its symmetric part. The bisection technique described above to 
locate eigenvalues of general matrices obviously simplifies to 1-dimensional bisection of the real axis. 



The SVD of A can be reduced to the symmetric eigenproblem either (1) for ^ T „ or (2) for AA T 

and A 1 A. But in either case, the notion of backward stability for the computed singular vectors would have 
to be modified slightly to account for possible difficulties in computing them. We can avoid this difficulty, and 
get a fully backward stable SVD algorithm, by separately computing orthogonal Ql and Qr whose leading 
r columns (nearly) span a left (resp. right) singular subspace of A, and forming Q T L AQ R . This should be 
(nearly) block diagonal, letting us continue with divide-and-conquer. Ql (resp. Qr) would be computed by 
applying our earlier algorithm to compute an orthogonal matrix whose leading r columns span an eigenspace 
of AA T (resp. A T A, for the same subset of the spectrum). Stability despite squaring A requires double 
precision, a cost hidden by the big-0 analysis. The algorithm would also check to see if Q\AQr is close 
enough to block diagonal, for any block size r, to enforce stability. 
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6.3 Solving the (generalized) Sylvester Equation 



To compute an invariant subspace of 
R 

^ we are lead to the equation 



A C 
B 



for the eigenvalues of B and spanned by the columns of 



A C 
B 





R 




R 




I 




I 



B 



(13) 



or the Sylvester equation AR — RB = —C to solve for R. When A and B are upper triangular as in Schur 
form, this is really a permuted triangular system of equations for the entries of R, where the diagonal entries 



of the triangular matrix are all possible differences Ay. 
the eigenvalues of A and B are distinct. 



Bjj, so the system is nonsingular precisely when 



Similarly, to compute a right (resp. left) deflating subspace of 



" A 


c ' 


-X 


' A 


C ' 





B 





B 



for the eigenvalues 



of B — XB and spanned by the columns of 



R 
I 



(resp. 



L 
I 



) we are led to the equations 



A C 
B 





R 




L 




I 




I 



B and 



A C 
B 





R 




L 




I 




I 



■ B 



or the generalized Sylvester equation AR — LB = —C, AR — LB = —C to solve for R and L. 
To derive a divide-and-conquer algorithm for the Sylvester equation, write it as 



An 




A 12 
A22 



R11 
R21 



R12 
R22 



R11 R12 
R21 R22 







B12 
B22 



On C12 
G'21 C22 



where all submatrices are dimensioned conformally. Multiplying this out we get the four equations 



A22R21 — R21B11 
A11-R11 — RiiBn 
A22R22 ~ R22B22 
A11R12 — R12B22 



—C21 
-C n 
— C22 

— C\2 



A12R2I 
R21B12 
Rl\B\2 



A12R2 



(14) 



(15) 
(16) 
(17) 
(18) 



We recognize this as four smaller Sylvester equations, where (fT5j) needs to be solved first for i?2i, which lets 
us evaluate the right hand sides of (fl6|) and (fT7|) . and finally (fT8|) . 

This idea is captured in the following algorithm, where for simplicity we assume all matrices are square 
and of the same dimension, a power of 2: 



function R = SylR(A, B, C) 
if (n — 1) then 



all matrices are n-by-n 





R = 


~C/{A-B) 






else 






(a) 


-R21 


= SylR(A 22 ,B n ,C 21 ) 




(b) 


R11 


= SylR(A n ,B 11 ,C n + 


A12R21) 


(c) 


R22 


= SylR{A 22 ,B 22 ,C 22 - 


R21B12) 


(d) 


R12 


= Syi?(Aii,5 2 2,Ci2- 


R11B12 ■ 



.. use notation from equation ([14")) 
... solve iQSJ) 
... solve HH) 
... solve Ull) 
... solve (ELl) 



end 

If the matrix multiplications in SylR were done using the conventional 0(n 3 ) algorithm, then SylR would 
perform the same arithmetic operations (and so make the same rounding errors) as a conventional Sylvester 
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solver, just in a different order. For the complexity analysis of SylR we assume n is a power of 2: 



cost(n) — cost (n/2) ...cost of line (a) 

+MM(n/2) + cost(n/2) ...cost of line (b) 

+MM(n/2) + cost(n/2) ...cost of line (c) 

+2 • MM(n/2) + cost(n/2) ...cost of line (d) 
= 4- cost(n/2) + 0(n UJ+rl ) 

= 0(n w+n ) ...as long as u + r\ > 2 

Proving logarithmic stability will depend on each subproblem having a condition number bounded by 
the condition number of the original problem. Since the Sylvester equation is really the triangular linear 
system (J(g> A — B T ® I) ■ vec(R) — — vec(C), where (8 is the Kronecker product and vec(R) is a vector of the 
columns of R stacked atop one another from left to right, the condition number of the Sylvester equation is 
taken to be the condition number of this linear system. This in turn is governed by the smallest singular 
value of the matrix, which is denoted sep(A, B): 

sep(A,B) = <j min {I ® A- B T ® I) = min \\AR-RB\\ F 

\\R\\f— l 

where || • \\p is the Frobenius norm [51j . Just as in Section [3. 11 where each subproblem involved inverting a 
diagonal block of the triangular matrix, and so had a condition number bounded by the original problem, 
here each subproblem also satisfies 

-<■!>•:. 1„. /-'..J > sep(A,B) . (19) 

This lets us prove that this algorithm is logarithmically stable as follows. Similar to before, we use 
the induction hypothesis that err(n') bounds the error in the solution of any smaller Sylvester equation of 
dimension n' encountered during the algorithm (including errors in computing the right-hand-side) . We use 
the fact that changing the right-hand-side of a Sylvester equation by a matrix bounded in norm by x can 
change the solution in norm by x/sep(A, B), as well as error bound ©: 

err(R 2 \ 1 n/2) < err(n/2) 

err(R n ,n/2) < err(n/2) + 7^r(e| C n \\ + Pdl ■ err{R 2U n/2) + n{n/2)e\\ A 12 \\ ■ \\R 21 \\) 

sep(A, rs j 

< err(n/2) + ^7^m (g| ^11 + Pll ' err(n/2) + fi(n/2)e\\A\\ ■ \\R\\) 

err(R 22 ,n/2) < err{n/2) + {e\ C 22 \\ + \\B 12 \\ ■ err(R 2U n/2) + M (n/2)e||5 12 || ■ ||i? 21 ||) 

M'p; .1. B) 

< err(n/2) + gep( ^ B) (g| C\\ + \\B\\ ■ err(n/2) + ^n/2)e\\B\\ ■ \\R\\) 

err{R 12l n/2) < err{n/2) + ^ ^ (g| C 12 \\ + \\B 12 \\ ■ err(R n ,n/2) + ^(n/2)e\\B 12 \\ ■ \\R n \\ 
+ \\A 12 \\ ■ err{R 22 ,n/2) + fi(n/2)e\\A 12 \\ ■ \\R 22 \\) 

< err(n/2) + C|| + (ll^ll + \\B\\) ■ err(n/2) + ^n/2)e(\\A\\ + \\B\\) ■ \\R\\) 
err(R, n) < err(Ru,n/2) + err(Ri 2 ,n/2) + err(R 2 i, n/2) + err(R 22 , n/2) 

< L + 2 ML+igf ) . err{n / 2 ) + £ (3\\C\\ + 2^(n/2)(\\A\\ + \\B\\)\\R\\) 
\ sep(A,B) J sep(A,B) 
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This yields the recurrence 

err(n) = ^4 + 2 
< 0( 



\B\ 



sep(A,B 
ne 



err(n/2) 



sep(A,.B) 



(3\\C\\+2»(n/2)(\\A\\ 



\B\\)\\R\ 



sep(A, B 
< O I 3 fj,(n/2)e\\R\ 



(||C|| +n(n/2)(\\A\\ + ||^||)||i?||)(2+ M 7 tM ) 1 ^") 



sep(A,B) 



\B\ 



sep(A, B) 



l + logo ™ N 



(20) 
(21) 



Since the error bound of a conventional algorithm for the Sylvester equation is 0(e||-R| 



we see our 



new algorithm is logarithmically stable. 

A similar approach works for the generalized Sylvester equation, which we just sketch. Equation (|14|) 
becomes 



An 


Al2 ' 




Rn 


R\2 




' ill 


L\2 






Bu 




' C u 


C\2 





A 22 _ 




R21 


R22 




L21 


L22 







B22 




C21 


C22 


An 


A12 ' 




Rn 


R12 




' ill 


Ll2 




' B u 


B\2 




' Cxi 


C12 





A22 




R21 


R22 




L21 


L22 







B22 




C21 


C22 



Multiplying these out leads to four generalized Sylvester equations, one for each subblock, which are solved 
consecutively and recursively as before. 

6.4 Computing (generalized) Eigenvectors 

Given a matrix in Schur form, i.e. triangular, each eigenvector may be computed in 0(n 2 ) operations by 
solving a triangular system of equations. But this would cost 0(n 3 ) to compute all n eigenvectors, so we 
need a different approach. Rewriting equation (|13p as 



T = 



' A 


C ' 




' I 


R ' 




' A 







' I 


R ' 





B 







I 







B 







I 



-I -1 



(22) 



we see that we have reduced the problem of computing an eigendecomposition of T to finding eigendecomposi- 
tions of i = VaAaV^ 1 and B = Vb^bV^ separately, and then multiplying 

i r 1 r v A 

i" J ' { V B 

algorithm, where as before we assume all submatrices are square, of the same power-of-2 dimensions: 



to get the eigenvector matrix of T. This leads to the following divide-and-conqucr 



(a) 
(b) 
(c) 

(d) 

(e) 



function V = EVecR(T) ... all matrices are n-by-n 
if (n = 1) then 
V = 1 

else ... use notation from equation 
R = SylR(A,B,C) 
V A = EVecR(A) 
V B = EVecR(B) 
V A R-V b 
V B 

for i = n/2 + 1 to n, V(:,i) = V(:, i)/\\V(:, i)\\ 2 , end for 
end if 



V = 
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For the complexity analysis we assume as before that n is a power of 2: yielding 



cost{n) = 0(n"+'') ...cost of line (a) 

+2 • cost (n/2) ...cost of lines (b) and (c) 
+MM(n/2) ...cost of line (d) 

+0(n 2 ) ...cost of line (e) 

= 2 • cost(n/2) + 0(ri w+r ') 

= 0(n w+ ") as desired. 

In general, each eigenvector has a different condition number, and ideally should be computed with a 
corresponding accuracy. If we compute each eigenvector separately using the conventional algorithm in 
0(n 2 ) operations, this will be the case, but it would take 0(n 3 ) operations to compute all the eigenvectors 
this way. 

Unfortunately, EVecR cannot guarantee this accuracy, for the following reason. If the first splitting of 
the spectrum is ill-conditioned, i.e. sep(A, B) is tiny, then this will affect the accuracy of all subsequently 
computed right eigenvectors for _B, through the multiplication R ■ Vb- Indeed, multiplication by R can 
cause an error in any eigenvector of B to affect any other eigenvector. (It would also affect the left (row) 
eigenvectors [Vj[ , —V^ 1 ■ R] for A, if we computed them.) This is true even if some of B's eigenvectors are 
much better conditioned. Similarly, further splittings within A (resp. B) will effect the accuracy of other 
right eigenvectors of A (resp. B), but not of B (resp. A), 

To simplify matters, we will derive one error bound that works for all eigenvectors, which may overestimate 
the error for some. || • |j will denote the 2-norm. We will do this by using a lower bound s > for all the 
values of sep(^4, B) encountered during the algorithm. Thus s could be taken to be the minimum value of 
all the sep(^4, B) values themselves, or bounded below by mini<j <n scp(T(l : i, 1 : i), T(i + 1 : n, i + 1 : n)), 
which follows from inequality (|19p . We can also use \\R\\ < \\T\\/s as a bound on the norm of any R at any 
stage in the algorithm. We use this to simplify bound (|2H| on the error of solving the Sylvester equation, 
yielding 

(Wmy +, " ! '") (23) 



for a modest constant c. Using the induction hypothesis that err(n') bounds the error in the computed 
n'-by-n' eigenvector matrix at any stage in the algorithm, as well as bound l|8|). we get 



\ 2+log 2 n\ 



err(R, n/2) = O \ n c e 

err(VA,n/2) < err(n/2) 
err(VB,n/2) < err(n/2) 

err(V,n) < err{V A , n/2) + err(V B , n/2) + err(R, n/2)||Vb|| + \\R\\err(V B ) + n(n/2)e\\R\\ ■ \\V A \\ 
< (2 + \\R\\)err(n/2) + • err(R, n/2) + ^/nfi(n/2)e\\R\\ 



< (2+\\R\\)err(n/2) + n c e 



lyll \ 2 + lo S2 ™ N 



for another modest constant d . Step (e) of EVecR, which makes each column have unit norm, will decrease 
the absolute error in large columns, but we omit this effect from our bounds, which are normwise in nature, 
and so get a larger upper bound. Bounding 2 + ||i?|| < 3-^, and changing variables from n = 2 m to m and 
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err(n) to err(m), we get 

IITII / 
errim) < 3 • err(m — 1) + O £ 

V 

Finally, setting f(m) — efr(m)/(3^^) m , we get a simple geometric sum 

f(m) </(m-l) + 

which, since 2 C > 3, leads to our final bound 

err(n) — O ^n c e 
demonstrating a form of logarithmic stability. 

7 Conclusions 

We have shown that nearly all standard dense linear algebra operations (LU decomposition, QR decomposi- 
tion, matrix inversion, linear equation solving, solving least squares problems, computing the (generalized) 
Schur form, computing the SVD, and solving (generalized) Sylvester equations) can be done stably and 
asymptotically as fast as the fastest matrix multiplication algorithm that may ever exist (whether the ma- 
trix multiplication algorithm is stable or not) . For all but matrix inversion and solving (generalized) Sylvester 
equations, stability means backward stability in a normwise sense, and we measure complexity by counting 
arithmetic operations. 

For matrix inversion and solving the Sylvester equation, stability means forward stability, i.e. that the 
error is bounded in norm by 0(e ■ k), machine epsilon times the appropriate condition number, just as for a 
conventional algorithm. The conventional matrix inversion algorithm is not backward stable for the matrix 
as a whole either, but requires a different backward error for each column. The conventional solution of the 
Sylvester equation is not backward stable either, and only has a forward error bound. 

Also, for matrix inversion and solving the Sylvester equation, we measure complexity by counting bit 
operations, to account for the use of a modest amount of extra precision. Indeed, we can say that matrix 
multiplication (stable or not) in 0(n w+7? ) operations for any r] > is possible if and only if forward stable 
inversion of an n-hy-n matrix A in 0(n UJ+v ) operations for any 7/ > is possible. (See Theorem 13.31 for a 
more careful statement of how operations are counted.) 

All eigenvectors of (generalized) nonsymmetric eigenvalue problems can also be computed in 0(n u+ri ) 
bit operations from the Schur form, but with a weaker notion of forward stability, where the error bound for 
all the eigenvectors, in the worst case, depends on the largest condition number of any eigenvector. 

Finally, we note several possible practical implications of our algorithms. Several of the recursive al- 
gorithms we used (QRR and LUR) were originally invented for their superior memory locality properties 
[271 150] , and the same property is likely to hold for our new recursive algorithms as well. The divide-and- 
conquer nature of these algorithms also naturally creates parallelism that could be exploited on various 
architectures. Even if one is not using an asymptotically faster matrix-multiplication algorithm, these algo- 
rithms could be advantageous on platforms that perform matrix multiplication much faster than other basic 
linear algebra subroutines. 
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